Proyek Analisis Data: E-Commerce Public Dataset¶
- Nama: Doni Maulana Syahputra
- Email: donimaulanas44@gmail.com
- ID Dicoding: maulanadoni
Menentukan Pertanyaan Bisnis¶
- Pertanyaan 1: Bagaimana Performa penjualan pada beberapa tahun terakhir dalam skala per bulan? (performa banyaknya order dan total revenue per month) done
- Pertanyaan 2: Produk apa yang paling menghasilkan revenue dan paling laku serta produk paling tidak laku dan tidak menghasilkan revenue?
- Pertanyaan 3: State apa yang menghasilkan revenue dan jumlah order paling tinggi?
- Pertanyaan 4: Metode pembayaran apa yang paling banyak jumlah pembayarannya dan sering digunakan dari berbagai states?
- Pertanyaan 5: RFM Analysis
- Pertanyaan 6: Bagaimana jika distribusi customer berdasarkan lokasi geografis dalam bentuk map? (Geospatial Analysis)
- Pertanyaan 7: Bagaimana jika mengkategorikan customer berdasarkan jumlah total pengeluaran untuk pembelian produk di platform ini? (Clustering)
Import Semua Packages/Library yang Digunakan¶
In [272]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
print('Setup Complete!')
Setup Complete!
Data Wrangling¶
Gathering Data¶
In [273]:
# import data-data dalam format csv
customers_df = pd.read_csv('customers_dataset.csv')
orders_dataset_df = pd.read_csv('orders_dataset.csv')
order_items_df = pd.read_csv('order_items_dataset.csv')
order_payments_df = pd.read_csv('order_payments_dataset.csv')
order_reviews_df = pd.read_csv('order_reviews_dataset.csv')
products_df = pd.read_csv('products_dataset.csv')
products_english_df = pd.read_csv('product_category_name_translation.csv')
sellers_df = pd.read_csv('sellers_dataset.csv')
geolocation_df = pd.read_csv('geolocation_dataset.csv')
print('Data loaded.')
Data loaded.
In [274]:
customers_df.head()
Out[274]:
| customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | |
|---|---|---|---|---|---|
| 0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 861eff4711a542e4b93843c6dd7febb0 | 14409 | franca | SP |
| 1 | 18955e83d337fd6b2def6b18a428ac77 | 290c77bc529b7ac935b93aa66c333dc3 | 9790 | sao bernardo do campo | SP |
| 2 | 4e7b3e00288586ebd08712fdd0374a03 | 060e732b5b29e8181a18229c7b0b2b5e | 1151 | sao paulo | SP |
| 3 | b2b6027bc5c5109e529d4dc6358b12c3 | 259dac757896d24d7702b9acbbff3f3c | 8775 | mogi das cruzes | SP |
| 4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 345ecd01c38d18a9036ed96c73b8d066 | 13056 | campinas | SP |
In [275]:
orders_dataset_df.head()
Out[275]:
| order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | |
|---|---|---|---|---|---|---|---|---|
| 0 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-02 11:07:15 | 2017-10-04 19:55:00 | 2017-10-10 21:25:13 | 2017-10-18 00:00:00 |
| 1 | 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | delivered | 2018-07-24 20:41:37 | 2018-07-26 03:24:27 | 2018-07-26 14:31:00 | 2018-08-07 15:27:45 | 2018-08-13 00:00:00 |
| 2 | 47770eb9100c2d0c44946d9cf07ec65d | 41ce2a54c0b03bf3443c3d931a367089 | delivered | 2018-08-08 08:38:49 | 2018-08-08 08:55:23 | 2018-08-08 13:50:00 | 2018-08-17 18:06:29 | 2018-09-04 00:00:00 |
| 3 | 949d5b44dbf5de918fe9c16f97b45f8a | f88197465ea7920adcdbec7375364d82 | delivered | 2017-11-18 19:28:06 | 2017-11-18 19:45:59 | 2017-11-22 13:39:59 | 2017-12-02 00:28:42 | 2017-12-15 00:00:00 |
| 4 | ad21c59c0840e6cb83a9ceb5573f8159 | 8ab97904e6daea8866dbdbc4fb7aad2c | delivered | 2018-02-13 21:18:39 | 2018-02-13 22:20:29 | 2018-02-14 19:46:34 | 2018-02-16 18:17:02 | 2018-02-26 00:00:00 |
In [276]:
order_items_df.head()
Out[276]:
| order_id | order_item_id | product_id | seller_id | shipping_limit_date | price | freight_value | |
|---|---|---|---|---|---|---|---|
| 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 1 | 4244733e06e7ecb4970a6e2683c13e61 | 48436dade18ac8b2bce089ec2a041202 | 2017-09-19 09:45:35 | 58.90 | 13.29 |
| 1 | 00018f77f2f0320c557190d7a144bdd3 | 1 | e5f2d52b802189ee658865ca93d83a8f | dd7ddc04e1b6c2c614352b383efe2d36 | 2017-05-03 11:05:13 | 239.90 | 19.93 |
| 2 | 000229ec398224ef6ca0657da4fc703e | 1 | c777355d18b72b67abbeef9df44fd0fd | 5b51032eddd242adc84c38acab88f23d | 2018-01-18 14:48:30 | 199.00 | 17.87 |
| 3 | 00024acbcdf0a6daa1e931b038114c75 | 1 | 7634da152a4610f1595efa32f14722fc | 9d7a1d34a5052409006425275ba1c2b4 | 2018-08-15 10:10:18 | 12.99 | 12.79 |
| 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 1 | ac6c3623068f30de03045865e4e10089 | df560393f3a51e74553ab94004ba5c87 | 2017-02-13 13:57:51 | 199.90 | 18.14 |
In [277]:
order_payments_df.head()
Out[277]:
| order_id | payment_sequential | payment_type | payment_installments | payment_value | |
|---|---|---|---|---|---|
| 0 | b81ef226f3fe1789b1e8b2acac839d17 | 1 | credit_card | 8 | 99.33 |
| 1 | a9810da82917af2d9aefd1278f1dcfa0 | 1 | credit_card | 1 | 24.39 |
| 2 | 25e8ea4e93396b6fa0d3dd708e76c1bd | 1 | credit_card | 1 | 65.71 |
| 3 | ba78997921bbcdc1373bb41e913ab953 | 1 | credit_card | 8 | 107.78 |
| 4 | 42fdf880ba16b47b59251dd489d4441a | 1 | credit_card | 2 | 128.45 |
In [278]:
order_reviews_df.head()
Out[278]:
| review_id | order_id | review_score | review_comment_title | review_comment_message | review_creation_date | review_answer_timestamp | |
|---|---|---|---|---|---|---|---|
| 0 | 7bc2406110b926393aa56f80a40eba40 | 73fc7af87114b39712e6da79b0a377eb | 4 | NaN | NaN | 2018-01-18 00:00:00 | 2018-01-18 21:46:59 |
| 1 | 80e641a11e56f04c1ad469d5645fdfde | a548910a1c6147796b98fdf73dbeba33 | 5 | NaN | NaN | 2018-03-10 00:00:00 | 2018-03-11 03:05:13 |
| 2 | 228ce5500dc1d8e020d8d1322874b6f0 | f9e4b658b201a9f2ecdecbb34bed034b | 5 | NaN | NaN | 2018-02-17 00:00:00 | 2018-02-18 14:36:24 |
| 3 | e64fb393e7b32834bb789ff8bb30750e | 658677c97b385a9be170737859d3511b | 5 | NaN | Recebi bem antes do prazo estipulado. | 2017-04-21 00:00:00 | 2017-04-21 22:02:06 |
| 4 | f7c4243c7fe1938f181bec41a392bdeb | 8e6bfb81e283fa7e4f11123a3fb894f1 | 5 | NaN | Parabéns lojas lannister adorei comprar pela I... | 2018-03-01 00:00:00 | 2018-03-02 10:26:53 |
In [279]:
products_df.head()
Out[279]:
| product_id | product_category_name | product_name_lenght | product_description_lenght | product_photos_qty | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1e9e8ef04dbcff4541ed26657ea517e5 | perfumaria | 40.0 | 287.0 | 1.0 | 225.0 | 16.0 | 10.0 | 14.0 |
| 1 | 3aa071139cb16b67ca9e5dea641aaa2f | artes | 44.0 | 276.0 | 1.0 | 1000.0 | 30.0 | 18.0 | 20.0 |
| 2 | 96bd76ec8810374ed1b65e291975717f | esporte_lazer | 46.0 | 250.0 | 1.0 | 154.0 | 18.0 | 9.0 | 15.0 |
| 3 | cef67bcfe19066a932b7673e239eb23d | bebes | 27.0 | 261.0 | 1.0 | 371.0 | 26.0 | 4.0 | 26.0 |
| 4 | 9dc1a7de274444849c219cff195d0b71 | utilidades_domesticas | 37.0 | 402.0 | 4.0 | 625.0 | 20.0 | 17.0 | 13.0 |
In [280]:
products_english_df.head()
Out[280]:
| product_category_name | product_category_name_english | |
|---|---|---|
| 0 | beleza_saude | health_beauty |
| 1 | informatica_acessorios | computers_accessories |
| 2 | automotivo | auto |
| 3 | cama_mesa_banho | bed_bath_table |
| 4 | moveis_decoracao | furniture_decor |
In [281]:
sellers_df.head()
Out[281]:
| seller_id | seller_zip_code_prefix | seller_city | seller_state | |
|---|---|---|---|---|
| 0 | 3442f8959a84dea7ee197c632cb2df15 | 13023 | campinas | SP |
| 1 | d1b65fc7debc3361ea86b5f14c68d2e2 | 13844 | mogi guacu | SP |
| 2 | ce3ad9de960102d0677a81f5d0bb7b2d | 20031 | rio de janeiro | RJ |
| 3 | c0f3eea2e14555b6faeea3dd58c1b1c3 | 4195 | sao paulo | SP |
| 4 | 51a04a8a6bdcb23deccc82b0b80742cf | 12914 | braganca paulista | SP |
In [282]:
geolocation_df.head()
Out[282]:
| geolocation_zip_code_prefix | geolocation_lat | geolocation_lng | geolocation_city | geolocation_state | |
|---|---|---|---|---|---|
| 0 | 1037 | -23.545621 | -46.639292 | sao paulo | SP |
| 1 | 1046 | -23.546081 | -46.644820 | sao paulo | SP |
| 2 | 1046 | -23.546129 | -46.642951 | sao paulo | SP |
| 3 | 1041 | -23.544392 | -46.639499 | sao paulo | SP |
| 4 | 1035 | -23.541578 | -46.641607 | sao paulo | SP |
Insight:
- Pada dataframe order_items_df terdapat paling banyak column dengan foreign key.
- Analisis nanti kita harus men-translate product_category_name ke bahasa inggris dari column product_category_name_english
- Pada tabel geolocation_df, kita bisa analisis dengan library khusus untuk spatial analysis, seperti geopandas.
- Pada tabel order_reviews_df yang kita ambil hanya review_score saja
Assessing Data¶
Check Data Type¶
In [283]:
customers_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 99441 entries, 0 to 99440 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 99441 non-null object 1 customer_unique_id 99441 non-null object 2 customer_zip_code_prefix 99441 non-null int64 3 customer_city 99441 non-null object 4 customer_state 99441 non-null object dtypes: int64(1), object(4) memory usage: 3.8+ MB
In [284]:
orders_dataset_df.info()
# terlihat dtype tabel yang mengandung informasi waktu masih dalam bentuk object
<class 'pandas.core.frame.DataFrame'> RangeIndex: 99441 entries, 0 to 99440 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 99441 non-null object 1 customer_id 99441 non-null object 2 order_status 99441 non-null object 3 order_purchase_timestamp 99441 non-null object 4 order_approved_at 99281 non-null object 5 order_delivered_carrier_date 97658 non-null object 6 order_delivered_customer_date 96476 non-null object 7 order_estimated_delivery_date 99441 non-null object dtypes: object(8) memory usage: 6.1+ MB
In [285]:
order_items_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 112650 entries, 0 to 112649 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 112650 non-null object 1 order_item_id 112650 non-null int64 2 product_id 112650 non-null object 3 seller_id 112650 non-null object 4 shipping_limit_date 112650 non-null object 5 price 112650 non-null float64 6 freight_value 112650 non-null float64 dtypes: float64(2), int64(1), object(4) memory usage: 6.0+ MB
In [286]:
order_payments_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 103886 entries, 0 to 103885 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 103886 non-null object 1 payment_sequential 103886 non-null int64 2 payment_type 103886 non-null object 3 payment_installments 103886 non-null int64 4 payment_value 103886 non-null float64 dtypes: float64(1), int64(2), object(2) memory usage: 4.0+ MB
In [287]:
order_reviews_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 99224 entries, 0 to 99223 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 review_id 99224 non-null object 1 order_id 99224 non-null object 2 review_score 99224 non-null int64 3 review_comment_title 11568 non-null object 4 review_comment_message 40977 non-null object 5 review_creation_date 99224 non-null object 6 review_answer_timestamp 99224 non-null object dtypes: int64(1), object(6) memory usage: 5.3+ MB
In [288]:
products_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32951 entries, 0 to 32950 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 product_id 32951 non-null object 1 product_category_name 32341 non-null object 2 product_name_lenght 32341 non-null float64 3 product_description_lenght 32341 non-null float64 4 product_photos_qty 32341 non-null float64 5 product_weight_g 32949 non-null float64 6 product_length_cm 32949 non-null float64 7 product_height_cm 32949 non-null float64 8 product_width_cm 32949 non-null float64 dtypes: float64(7), object(2) memory usage: 2.3+ MB
In [289]:
sellers_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3095 entries, 0 to 3094 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 seller_id 3095 non-null object 1 seller_zip_code_prefix 3095 non-null int64 2 seller_city 3095 non-null object 3 seller_state 3095 non-null object dtypes: int64(1), object(3) memory usage: 96.8+ KB
In [290]:
geolocation_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000163 entries, 0 to 1000162 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 geolocation_zip_code_prefix 1000163 non-null int64 1 geolocation_lat 1000163 non-null float64 2 geolocation_lng 1000163 non-null float64 3 geolocation_city 1000163 non-null object 4 geolocation_state 1000163 non-null object dtypes: float64(2), int64(1), object(2) memory usage: 38.2+ MB
Check Missing Value¶
In [291]:
customers_df.isna().sum()
Out[291]:
customer_id 0 customer_unique_id 0 customer_zip_code_prefix 0 customer_city 0 customer_state 0 dtype: int64
In [292]:
orders_dataset_df.isna().sum()
# terdapat missing value pada beberapa kolom
Out[292]:
order_id 0 customer_id 0 order_status 0 order_purchase_timestamp 0 order_approved_at 160 order_delivered_carrier_date 1783 order_delivered_customer_date 2965 order_estimated_delivery_date 0 dtype: int64
In [293]:
order_items_df.isna().sum()
Out[293]:
order_id 0 order_item_id 0 product_id 0 seller_id 0 shipping_limit_date 0 price 0 freight_value 0 dtype: int64
In [294]:
order_payments_df.isna().sum()
Out[294]:
order_id 0 payment_sequential 0 payment_type 0 payment_installments 0 payment_value 0 dtype: int64
In [295]:
order_reviews_df.isna().sum()
# review_comment_title dan review_comment_message akan di-drop, karena tidak akan dianalisis sentimen di tahap selanjutnya.
Out[295]:
review_id 0 order_id 0 review_score 0 review_comment_title 87656 review_comment_message 58247 review_creation_date 0 review_answer_timestamp 0 dtype: int64
In [296]:
products_df.isna().sum()
# terdapat missing value pada beberapa kolom
Out[296]:
product_id 0 product_category_name 610 product_name_lenght 610 product_description_lenght 610 product_photos_qty 610 product_weight_g 2 product_length_cm 2 product_height_cm 2 product_width_cm 2 dtype: int64
In [297]:
sellers_df.isna().sum()
Out[297]:
seller_id 0 seller_zip_code_prefix 0 seller_city 0 seller_state 0 dtype: int64
In [298]:
geolocation_df.isna().sum()
Out[298]:
geolocation_zip_code_prefix 0 geolocation_lat 0 geolocation_lng 0 geolocation_city 0 geolocation_state 0 dtype: int64
Check Duplicate Value¶
In [299]:
customers_df.duplicated().sum()
Out[299]:
0
In [300]:
orders_dataset_df.duplicated().sum()
Out[300]:
0
In [301]:
order_items_df.duplicated().sum()
Out[301]:
0
In [302]:
order_payments_df.duplicated().sum()
Out[302]:
0
In [303]:
order_reviews_df.duplicated().sum()
Out[303]:
0
In [304]:
products_df.duplicated().sum()
Out[304]:
0
In [305]:
sellers_df.duplicated().sum()
Out[305]:
0
Check Inaccurate & Inconsistent Value¶
In [306]:
customers_df.customer_state.value_counts()
Out[306]:
customer_state SP 41746 RJ 12852 MG 11635 RS 5466 PR 5045 SC 3637 BA 3380 DF 2140 ES 2033 GO 2020 PE 1652 CE 1336 PA 975 MT 907 MA 747 MS 715 PB 536 PI 495 RN 485 AL 413 SE 350 TO 280 RO 253 AM 148 AC 81 AP 68 RR 46 Name: count, dtype: int64
In [307]:
orders_dataset_df['order_status'].value_counts()
# hanya order_status yang delivered yang akan dianalisis di tahap selanjutnya
Out[307]:
order_status delivered 96478 shipped 1107 canceled 625 unavailable 609 invoiced 314 processing 301 created 5 approved 2 Name: count, dtype: int64
In [308]:
order_payments_df['payment_type'].value_counts()
Out[308]:
payment_type credit_card 76795 boleto 19784 voucher 5775 debit_card 1529 not_defined 3 Name: count, dtype: int64
In [309]:
order_payments_df[order_payments_df['payment_type']=='not_defined']
Out[309]:
| order_id | payment_sequential | payment_type | payment_installments | payment_value | |
|---|---|---|---|---|---|
| 51280 | 4637ca194b6387e2d538dc89b124b0ee | 1 | not_defined | 1 | 0.0 |
| 57411 | 00b1cb0320190ca0daa2c88b35206009 | 1 | not_defined | 1 | 0.0 |
| 94427 | c8c528189310eaa44a745b8d9d26908b | 1 | not_defined | 1 | 0.0 |
| Data Type | Missing Value | Duplicate Value | Inaccurate Value | |
|---|---|---|---|---|
| customers_df | ||||
| orders_dataset_df | order_purchase_timestamp | order_approved_at = 160 | ||
| order_approved_at | order_delivered_carrier_date = 1783 | |||
| order_delivered_carrier_date | order_delivered_customer_date = 2 | |||
| order_delivered_customer_date | ||||
| order_estimated_delivery_date | ||||
| order_items_df | ||||
| order_payments_df | ||||
| order_reviews_df | review_comment_title 87656 | |||
| review_comment_message 58247 | ||||
| products_df | product_category_name = 610 | |||
| product_name_lenght = 610 | ||||
| product_description_lenght = 610 | ||||
| product_photos_qty = 610 | ||||
| product_weight_g = 2 | ||||
| product_length_cm = 2 | ||||
| product_height_cm = 2 | ||||
| product_width_cm = 2 | ||||
| products_english_df | ||||
| sellers_df | ||||
| geolocation_df |
Insight:
- Terdapat kesalahan data type pada table orders_dataset_df
- Terdapat missing value pada table orders_dataset_df, order_reviews_df, dan products_df
- Pada kolom 'order_status' di tabel orders_dataset_df, hanya status yang delivered yang akan dianalisis nantinya, selain itu akan di-drop.
- Pada tabel order_reviews_df, hanya rating score yang akan dianalisis lebih lanjut, sisanya akan didrop seperti comment title, message, etc.
Cleaning Data¶
orders_dataset_df¶
In [310]:
orders_dataset_df = orders_dataset_df[orders_dataset_df['order_status'] == 'delivered'].reset_index(drop=True)
orders_dataset_df.head()
Out[310]:
| order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | |
|---|---|---|---|---|---|---|---|---|
| 0 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-02 11:07:15 | 2017-10-04 19:55:00 | 2017-10-10 21:25:13 | 2017-10-18 00:00:00 |
| 1 | 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | delivered | 2018-07-24 20:41:37 | 2018-07-26 03:24:27 | 2018-07-26 14:31:00 | 2018-08-07 15:27:45 | 2018-08-13 00:00:00 |
| 2 | 47770eb9100c2d0c44946d9cf07ec65d | 41ce2a54c0b03bf3443c3d931a367089 | delivered | 2018-08-08 08:38:49 | 2018-08-08 08:55:23 | 2018-08-08 13:50:00 | 2018-08-17 18:06:29 | 2018-09-04 00:00:00 |
| 3 | 949d5b44dbf5de918fe9c16f97b45f8a | f88197465ea7920adcdbec7375364d82 | delivered | 2017-11-18 19:28:06 | 2017-11-18 19:45:59 | 2017-11-22 13:39:59 | 2017-12-02 00:28:42 | 2017-12-15 00:00:00 |
| 4 | ad21c59c0840e6cb83a9ceb5573f8159 | 8ab97904e6daea8866dbdbc4fb7aad2c | delivered | 2018-02-13 21:18:39 | 2018-02-13 22:20:29 | 2018-02-14 19:46:34 | 2018-02-16 18:17:02 | 2018-02-26 00:00:00 |
In [311]:
drop_columns = ['order_approved_at','order_delivered_carrier_date', 'order_delivered_customer_date']
orders_dataset_df.drop(drop_columns, axis=1, inplace=True)
orders_dataset_df.head()
#Orders_dataset selesai
Out[311]:
| order_id | customer_id | order_status | order_purchase_timestamp | order_estimated_delivery_date | |
|---|---|---|---|---|---|
| 0 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-18 00:00:00 |
| 1 | 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | delivered | 2018-07-24 20:41:37 | 2018-08-13 00:00:00 |
| 2 | 47770eb9100c2d0c44946d9cf07ec65d | 41ce2a54c0b03bf3443c3d931a367089 | delivered | 2018-08-08 08:38:49 | 2018-09-04 00:00:00 |
| 3 | 949d5b44dbf5de918fe9c16f97b45f8a | f88197465ea7920adcdbec7375364d82 | delivered | 2017-11-18 19:28:06 | 2017-12-15 00:00:00 |
| 4 | ad21c59c0840e6cb83a9ceb5573f8159 | 8ab97904e6daea8866dbdbc4fb7aad2c | delivered | 2018-02-13 21:18:39 | 2018-02-26 00:00:00 |
In [312]:
columns_to_datetime = ['order_purchase_timestamp', 'order_estimated_delivery_date']
for column in columns_to_datetime:
orders_dataset_df[column] = pd.to_datetime(orders_dataset_df[column])
In [313]:
orders_dataset_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 96478 entries, 0 to 96477 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 96478 non-null object 1 customer_id 96478 non-null object 2 order_status 96478 non-null object 3 order_purchase_timestamp 96478 non-null datetime64[ns] 4 order_estimated_delivery_date 96478 non-null datetime64[ns] dtypes: datetime64[ns](2), object(3) memory usage: 3.7+ MB
order_reviews_df¶
In [314]:
order_reviews_df.head()
drop_columns = ['review_comment_message', 'review_comment_title', 'review_creation_date', 'review_answer_timestamp']
order_reviews_df.drop(drop_columns, axis=1, inplace=True)
order_reviews_df.head()
Out[314]:
| review_id | order_id | review_score | |
|---|---|---|---|
| 0 | 7bc2406110b926393aa56f80a40eba40 | 73fc7af87114b39712e6da79b0a377eb | 4 |
| 1 | 80e641a11e56f04c1ad469d5645fdfde | a548910a1c6147796b98fdf73dbeba33 | 5 |
| 2 | 228ce5500dc1d8e020d8d1322874b6f0 | f9e4b658b201a9f2ecdecbb34bed034b | 5 |
| 3 | e64fb393e7b32834bb789ff8bb30750e | 658677c97b385a9be170737859d3511b | 5 |
| 4 | f7c4243c7fe1938f181bec41a392bdeb | 8e6bfb81e283fa7e4f11123a3fb894f1 | 5 |
In [315]:
order_reviews_df.head()
Out[315]:
| review_id | order_id | review_score | |
|---|---|---|---|
| 0 | 7bc2406110b926393aa56f80a40eba40 | 73fc7af87114b39712e6da79b0a377eb | 4 |
| 1 | 80e641a11e56f04c1ad469d5645fdfde | a548910a1c6147796b98fdf73dbeba33 | 5 |
| 2 | 228ce5500dc1d8e020d8d1322874b6f0 | f9e4b658b201a9f2ecdecbb34bed034b | 5 |
| 3 | e64fb393e7b32834bb789ff8bb30750e | 658677c97b385a9be170737859d3511b | 5 |
| 4 | f7c4243c7fe1938f181bec41a392bdeb | 8e6bfb81e283fa7e4f11123a3fb894f1 | 5 |
products_df¶
In [316]:
# We need to translate the words from column product_category_name in products_df table to english to get a better understanding at the data.
products_df= pd.merge(left=products_df, right=products_english_df, how='left', on='product_category_name')
In [317]:
products_df.head()
Out[317]:
| product_id | product_category_name | product_name_lenght | product_description_lenght | product_photos_qty | product_weight_g | product_length_cm | product_height_cm | product_width_cm | product_category_name_english | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1e9e8ef04dbcff4541ed26657ea517e5 | perfumaria | 40.0 | 287.0 | 1.0 | 225.0 | 16.0 | 10.0 | 14.0 | perfumery |
| 1 | 3aa071139cb16b67ca9e5dea641aaa2f | artes | 44.0 | 276.0 | 1.0 | 1000.0 | 30.0 | 18.0 | 20.0 | art |
| 2 | 96bd76ec8810374ed1b65e291975717f | esporte_lazer | 46.0 | 250.0 | 1.0 | 154.0 | 18.0 | 9.0 | 15.0 | sports_leisure |
| 3 | cef67bcfe19066a932b7673e239eb23d | bebes | 27.0 | 261.0 | 1.0 | 371.0 | 26.0 | 4.0 | 26.0 | baby |
| 4 | 9dc1a7de274444849c219cff195d0b71 | utilidades_domesticas | 37.0 | 402.0 | 4.0 | 625.0 | 20.0 | 17.0 | 13.0 | housewares |
In [318]:
# Since only 610 rows that have null values, removing these rows won't have significant impact to the analysis
products_df.dropna(inplace=True)
order_payments_df¶
In [319]:
order_payments_df.drop(order_payments_df.loc[order_payments_df['payment_type']=='not_defined'].index, inplace=True)
In [320]:
order_payments_df[order_payments_df['payment_type']=='not_defined']
Out[320]:
| order_id | payment_sequential | payment_type | payment_installments | payment_value |
|---|
In [321]:
order_payments_df.groupby('payment_type').count()
Out[321]:
| order_id | payment_sequential | payment_installments | payment_value | |
|---|---|---|---|---|
| payment_type | ||||
| boleto | 19784 | 19784 | 19784 | 19784 |
| credit_card | 76795 | 76795 | 76795 | 76795 |
| debit_card | 1529 | 1529 | 1529 | 1529 |
| voucher | 5775 | 5775 | 5775 | 5775 |
Insight:
- semua tabel sudah siap dan tahap selanjutnya harus di merge dengan tabel yang berkaitan agar bisa memunculkan insight
- geolocation akan dianalisis di akhir
Exploratory Data Analysis (EDA)¶
Explore customers_df¶
In [322]:
customers_df.describe(include="all")
Out[322]:
| customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | |
|---|---|---|---|---|---|
| count | 99441 | 99441 | 99441.000000 | 99441 | 99441 |
| unique | 99441 | 96096 | NaN | 4119 | 27 |
| top | 06b8999e2fba1a1fbc88172c00ba8bc7 | 8d50f5eadf50201ccdcedfb9e2ac8455 | NaN | sao paulo | SP |
| freq | 1 | 17 | NaN | 15540 | 41746 |
| mean | NaN | NaN | 35137.474583 | NaN | NaN |
| std | NaN | NaN | 29797.938996 | NaN | NaN |
| min | NaN | NaN | 1003.000000 | NaN | NaN |
| 25% | NaN | NaN | 11347.000000 | NaN | NaN |
| 50% | NaN | NaN | 24416.000000 | NaN | NaN |
| 75% | NaN | NaN | 58900.000000 | NaN | NaN |
| max | NaN | NaN | 99990.000000 | NaN | NaN |
In [323]:
# jumlah customer dari berbagai state
customers_df.groupby('customer_state')['customer_id'].nunique().sort_values(ascending=False)
Out[323]:
customer_state SP 41746 RJ 12852 MG 11635 RS 5466 PR 5045 SC 3637 BA 3380 DF 2140 ES 2033 GO 2020 PE 1652 CE 1336 PA 975 MT 907 MA 747 MS 715 PB 536 PI 495 RN 485 AL 413 SE 350 TO 280 RO 253 AM 148 AC 81 AP 68 RR 46 Name: customer_id, dtype: int64
In [324]:
# jumlah customer dari berbagai city
customers_df.groupby('customer_city')['customer_id'].nunique().sort_values(ascending=False)
Out[324]:
customer_city
sao paulo 15540
rio de janeiro 6882
belo horizonte 2773
brasilia 2131
curitiba 1521
...
ibiara 1
rio espera 1
rio dos indios 1
rio dos cedros 1
lagoao 1
Name: customer_id, Length: 4119, dtype: int64
Explore orders_dataset_df¶
In [325]:
# membuat variabel delivery time
delivery_time = orders_dataset_df["order_estimated_delivery_date"] - orders_dataset_df["order_purchase_timestamp"]
delivery_time = delivery_time.apply(lambda x: x.total_seconds())
orders_dataset_df["delivery_time"] = round(delivery_time/86400)
In [326]:
orders_dataset_df.describe(include="all")
Out[326]:
| order_id | customer_id | order_status | order_purchase_timestamp | order_estimated_delivery_date | delivery_time | |
|---|---|---|---|---|---|---|
| count | 96478 | 96478 | 96478 | 96478 | 96478 | 96478.000000 |
| unique | 96478 | 96478 | 1 | NaN | NaN | NaN |
| top | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | NaN | NaN | NaN |
| freq | 1 | 1 | 96478 | NaN | NaN | NaN |
| mean | NaN | NaN | NaN | 2018-01-01 23:29:31.939913984 | 2018-01-25 17:09:52.325711616 | 23.644188 |
| min | NaN | NaN | NaN | 2016-09-15 12:16:38 | 2016-10-04 00:00:00 | 2.000000 |
| 25% | NaN | NaN | NaN | 2017-09-14 09:00:23.249999872 | 2017-10-05 00:00:00 | 18.000000 |
| 50% | NaN | NaN | NaN | 2018-01-20 19:45:45 | 2018-02-16 00:00:00 | 23.000000 |
| 75% | NaN | NaN | NaN | 2018-05-05 18:54:47 | 2018-05-28 00:00:00 | 28.000000 |
| max | NaN | NaN | NaN | 2018-08-29 15:00:37 | 2018-10-25 00:00:00 | 155.000000 |
| std | NaN | NaN | NaN | NaN | NaN | 8.769316 |
Explore orders_customers_df¶
In [327]:
# menambah column baru yaitu status untuk mengetahui status user aktif atau non aktif
customers_id_in_orders_df = orders_dataset_df.customer_id.tolist()
customers_df['status'] = customers_df['customer_id'].apply(lambda x: 'Active' if x in customers_id_in_orders_df else 'Non Active')
customers_df.sample(5)
Out[327]:
| customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | status | |
|---|---|---|---|---|---|---|
| 45346 | 563c5a6fccb510f26534e34cc177b87e | 60d2403e6d0b05d5213ae7d7d9b70080 | 4363 | sao paulo | SP | Active |
| 55751 | 60c9c9568fdcbbccfc7ddc387f76e200 | 5cdbff27ae9bb0812d23e36f517feb23 | 29208 | guarapari | ES | Active |
| 38777 | 10e4208d1da23cd8f6ee0477f6b5cc9a | af32334783564756b0928aac81c2606b | 28930 | arraial do cabo | RJ | Active |
| 26549 | c0c563d66d3d0182b86b18db6eecda05 | 4c4cb7298a837849426bea982a676ae8 | 14811 | araraquara | SP | Active |
| 65244 | 735c9db72feaec11cefd42d50c2b7f66 | 6267d5c4c84591fa44e9c56c4e31af20 | 11660 | caraguatatuba | SP | Active |
In [328]:
customers_df.groupby('status').customer_id.count()
Out[328]:
status Active 96478 Non Active 2963 Name: customer_id, dtype: int64
In [329]:
orders_customers_df = pd.merge(left=orders_dataset_df, right=customers_df, how='left', on='customer_id')
orders_customers_df.head()
Out[329]:
| order_id | customer_id | order_status | order_purchase_timestamp | order_estimated_delivery_date | delivery_time | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-18 | 16.0 | 7c396fd4830fd04220f754e42b4e5bff | 3149 | sao paulo | SP | Active |
| 1 | 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | delivered | 2018-07-24 20:41:37 | 2018-08-13 | 19.0 | af07308b275d755c9edb36a90c618231 | 47813 | barreiras | BA | Active |
| 2 | 47770eb9100c2d0c44946d9cf07ec65d | 41ce2a54c0b03bf3443c3d931a367089 | delivered | 2018-08-08 08:38:49 | 2018-09-04 | 27.0 | 3a653a41f6f9fc3d2a113cf8398680e8 | 75265 | vianopolis | GO | Active |
| 3 | 949d5b44dbf5de918fe9c16f97b45f8a | f88197465ea7920adcdbec7375364d82 | delivered | 2017-11-18 19:28:06 | 2017-12-15 | 26.0 | 7c142cf63193a1473d2e66489a9ae977 | 59296 | sao goncalo do amarante | RN | Active |
| 4 | ad21c59c0840e6cb83a9ceb5573f8159 | 8ab97904e6daea8866dbdbc4fb7aad2c | delivered | 2018-02-13 21:18:39 | 2018-02-26 | 12.0 | 72632f0f9dd73dfee390c9b22eb56dd6 | 9195 | santo andre | SP | Active |
In [330]:
# Jumlah order berdasarkan state
orders_customers_df.groupby('customer_state').order_id.nunique().sort_values(ascending=False).reset_index().head(10)
Out[330]:
| customer_state | order_id | |
|---|---|---|
| 0 | SP | 40501 |
| 1 | RJ | 12350 |
| 2 | MG | 11354 |
| 3 | RS | 5345 |
| 4 | PR | 4923 |
| 5 | SC | 3546 |
| 6 | BA | 3256 |
| 7 | DF | 2080 |
| 8 | ES | 1995 |
| 9 | GO | 1957 |
In [331]:
# Jumlah order berdasarkan city
orders_customers_df.groupby(by="customer_city").order_id.nunique().sort_values(ascending=False).reset_index().head(10)
Out[331]:
| customer_city | order_id | |
|---|---|---|
| 0 | sao paulo | 15045 |
| 1 | rio de janeiro | 6601 |
| 2 | belo horizonte | 2697 |
| 3 | brasilia | 2071 |
| 4 | curitiba | 1489 |
| 5 | campinas | 1406 |
| 6 | porto alegre | 1342 |
| 7 | salvador | 1188 |
| 8 | guarulhos | 1144 |
| 9 | sao bernardo do campo | 911 |
Explore order_payment_df¶
In [332]:
order_payments_df.describe(include='all')
Out[332]:
| order_id | payment_sequential | payment_type | payment_installments | payment_value | |
|---|---|---|---|---|---|
| count | 103883 | 103883.000000 | 103883 | 103883.000000 | 103883.000000 |
| unique | 99437 | NaN | 4 | NaN | NaN |
| top | fa65dad1b0e818e3ccc5cb0e39231352 | NaN | credit_card | NaN | NaN |
| freq | 29 | NaN | 76795 | NaN | NaN |
| mean | NaN | 1.092681 | NaN | 2.853402 | 154.104831 |
| std | NaN | 0.706594 | NaN | 2.687071 | 217.495628 |
| min | NaN | 1.000000 | NaN | 0.000000 | 0.000000 |
| 25% | NaN | 1.000000 | NaN | 1.000000 | 56.800000 |
| 50% | NaN | 1.000000 | NaN | 1.000000 | 100.000000 |
| 75% | NaN | 1.000000 | NaN | 4.000000 | 171.840000 |
| max | NaN | 29.000000 | NaN | 24.000000 | 13664.080000 |
In [333]:
order_payments_df.groupby('payment_type').agg({
'payment_installments' : 'sum',
'payment_value' : 'sum'
}).sort_values(by='payment_value',ascending=False)
# Credit card merupakan metode pembayaran paling banyak, lalu disusul dengan boleto
Out[333]:
| payment_installments | payment_value | |
|---|---|---|
| payment_type | ||
| credit_card | 269332 | 12542084.19 |
| boleto | 19784 | 2869361.27 |
| voucher | 5775 | 379436.87 |
| debit_card | 1529 | 217989.79 |
Explore orders_customers_payments_df¶
In [334]:
orders_customers_payments_df = pd.merge(left=orders_customers_df, right=order_payments_df, how='left', on='order_id')
orders_customers_payments_df.head()
Out[334]:
| order_id | customer_id | order_status | order_purchase_timestamp | order_estimated_delivery_date | delivery_time | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | status | payment_sequential | payment_type | payment_installments | payment_value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-18 | 16.0 | 7c396fd4830fd04220f754e42b4e5bff | 3149 | sao paulo | SP | Active | 1.0 | credit_card | 1.0 | 18.12 |
| 1 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-18 | 16.0 | 7c396fd4830fd04220f754e42b4e5bff | 3149 | sao paulo | SP | Active | 3.0 | voucher | 1.0 | 2.00 |
| 2 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-18 | 16.0 | 7c396fd4830fd04220f754e42b4e5bff | 3149 | sao paulo | SP | Active | 2.0 | voucher | 1.0 | 18.59 |
| 3 | 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | delivered | 2018-07-24 20:41:37 | 2018-08-13 | 19.0 | af07308b275d755c9edb36a90c618231 | 47813 | barreiras | BA | Active | 1.0 | boleto | 1.0 | 141.46 |
| 4 | 47770eb9100c2d0c44946d9cf07ec65d | 41ce2a54c0b03bf3443c3d931a367089 | delivered | 2018-08-08 08:38:49 | 2018-09-04 | 27.0 | 3a653a41f6f9fc3d2a113cf8398680e8 | 75265 | vianopolis | GO | Active | 1.0 | credit_card | 3.0 | 179.12 |
In [335]:
orders_customers_payments_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100757 entries, 0 to 100756 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 100757 non-null object 1 customer_id 100757 non-null object 2 order_status 100757 non-null object 3 order_purchase_timestamp 100757 non-null datetime64[ns] 4 order_estimated_delivery_date 100757 non-null datetime64[ns] 5 delivery_time 100757 non-null float64 6 customer_unique_id 100757 non-null object 7 customer_zip_code_prefix 100757 non-null int64 8 customer_city 100757 non-null object 9 customer_state 100757 non-null object 10 status 100757 non-null object 11 payment_sequential 100756 non-null float64 12 payment_type 100756 non-null object 13 payment_installments 100756 non-null float64 14 payment_value 100756 non-null float64 dtypes: datetime64[ns](2), float64(4), int64(1), object(8) memory usage: 11.5+ MB
In [336]:
orders_customers_payments_df.isna().sum()
Out[336]:
order_id 0 customer_id 0 order_status 0 order_purchase_timestamp 0 order_estimated_delivery_date 0 delivery_time 0 customer_unique_id 0 customer_zip_code_prefix 0 customer_city 0 customer_state 0 status 0 payment_sequential 1 payment_type 1 payment_installments 1 payment_value 1 dtype: int64
In [337]:
orders_customers_payments_df[orders_customers_payments_df.isna().any(axis=1)]
#Drop 1 rows 4 values kayaknya tidak akan berpengaruh secara signifikan dari keseluruhan data.
Out[337]:
| order_id | customer_id | order_status | order_purchase_timestamp | order_estimated_delivery_date | delivery_time | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | status | payment_sequential | payment_type | payment_installments | payment_value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 31175 | bfbd0f9bdef84302105ad712db648a6c | 86dc2ffce2dfff336de2f386a786e574 | delivered | 2016-09-15 12:16:38 | 2016-10-04 | 18.0 | 830d5b7aaa3b6f1e9ad63703bec97d23 | 14600 | sao joaquim da barra | SP | Active | NaN | NaN | NaN | NaN |
In [338]:
orders_customers_payments_df.dropna(inplace=True)
In [339]:
orders_customers_payments_df.duplicated().sum()
Out[339]:
0
Explore order_customers_payments_reviews_df¶
In [340]:
order_customers_payments_reviews_df = pd.merge(left=orders_customers_payments_df, right=order_reviews_df, how='left', on='order_id')
order_customers_payments_reviews_df.head()
Out[340]:
| order_id | customer_id | order_status | order_purchase_timestamp | order_estimated_delivery_date | delivery_time | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | status | payment_sequential | payment_type | payment_installments | payment_value | review_id | review_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-18 | 16.0 | 7c396fd4830fd04220f754e42b4e5bff | 3149 | sao paulo | SP | Active | 1.0 | credit_card | 1.0 | 18.12 | a54f0611adc9ed256b57ede6b6eb5114 | 4.0 |
| 1 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-18 | 16.0 | 7c396fd4830fd04220f754e42b4e5bff | 3149 | sao paulo | SP | Active | 3.0 | voucher | 1.0 | 2.00 | a54f0611adc9ed256b57ede6b6eb5114 | 4.0 |
| 2 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-18 | 16.0 | 7c396fd4830fd04220f754e42b4e5bff | 3149 | sao paulo | SP | Active | 2.0 | voucher | 1.0 | 18.59 | a54f0611adc9ed256b57ede6b6eb5114 | 4.0 |
| 3 | 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | delivered | 2018-07-24 20:41:37 | 2018-08-13 | 19.0 | af07308b275d755c9edb36a90c618231 | 47813 | barreiras | BA | Active | 1.0 | boleto | 1.0 | 141.46 | 8d5266042046a06655c8db133d120ba5 | 4.0 |
| 4 | 47770eb9100c2d0c44946d9cf07ec65d | 41ce2a54c0b03bf3443c3d931a367089 | delivered | 2018-08-08 08:38:49 | 2018-09-04 | 27.0 | 3a653a41f6f9fc3d2a113cf8398680e8 | 75265 | vianopolis | GO | Active | 1.0 | credit_card | 3.0 | 179.12 | e73b67b67587f7644d5bd1a52deb1b01 | 5.0 |
In [341]:
order_customers_payments_reviews_df.isna().sum()
Out[341]:
order_id 0 customer_id 0 order_status 0 order_purchase_timestamp 0 order_estimated_delivery_date 0 delivery_time 0 customer_unique_id 0 customer_zip_code_prefix 0 customer_city 0 customer_state 0 status 0 payment_sequential 0 payment_type 0 payment_installments 0 payment_value 0 review_id 676 review_score 676 dtype: int64
In [342]:
order_customers_payments_reviews_df.dropna(inplace=True)
In [343]:
order_customers_payments_reviews_df.groupby('review_score')['order_id'].nunique().reset_index().sort_values(by='order_id', ascending=False)
Out[343]:
| review_score | order_id | |
|---|---|---|
| 4 | 5.0 | 56817 |
| 3 | 4.0 | 18943 |
| 0 | 1.0 | 9380 |
| 2 | 3.0 | 7942 |
| 1 | 2.0 | 2938 |
Explore order_items_df dan products_df¶
In [344]:
# Kita harus langsung merge dengan products_df agar mengetahui nama dari produk-produk didalam table order_items_df.
product_order_items_df = pd.merge(left=order_items_df, right=products_df, how='left', on='product_id')
product_order_items_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 112650 entries, 0 to 112649 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 112650 non-null object 1 order_item_id 112650 non-null int64 2 product_id 112650 non-null object 3 seller_id 112650 non-null object 4 shipping_limit_date 112650 non-null object 5 price 112650 non-null float64 6 freight_value 112650 non-null float64 7 product_category_name 111022 non-null object 8 product_name_lenght 111022 non-null float64 9 product_description_lenght 111022 non-null float64 10 product_photos_qty 111022 non-null float64 11 product_weight_g 111022 non-null float64 12 product_length_cm 111022 non-null float64 13 product_height_cm 111022 non-null float64 14 product_width_cm 111022 non-null float64 15 product_category_name_english 111022 non-null object dtypes: float64(9), int64(1), object(6) memory usage: 13.8+ MB
In [345]:
product_order_items_df.isna().sum()
Out[345]:
order_id 0 order_item_id 0 product_id 0 seller_id 0 shipping_limit_date 0 price 0 freight_value 0 product_category_name 1628 product_name_lenght 1628 product_description_lenght 1628 product_photos_qty 1628 product_weight_g 1628 product_length_cm 1628 product_height_cm 1628 product_width_cm 1628 product_category_name_english 1628 dtype: int64
In [346]:
# There are so many missing values, but removing these null values still not significantly impact the analysis
product_order_items_df.dropna(inplace=True)
In [347]:
# Jumlah revenue dari berbagai kategori produk
product_order_items_df.groupby('product_category_name_english').price.sum().reset_index().sort_values(by='price', ascending=False)
Out[347]:
| product_category_name_english | price | |
|---|---|---|
| 43 | health_beauty | 1258681.34 |
| 70 | watches_gifts | 1205005.68 |
| 7 | bed_bath_table | 1036988.68 |
| 65 | sports_leisure | 988048.97 |
| 15 | computers_accessories | 911954.32 |
| ... | ... | ... |
| 35 | flowers | 1110.04 |
| 46 | home_comfort_2 | 760.27 |
| 11 | cds_dvds_musicals | 730.00 |
| 29 | fashion_childrens_clothes | 569.85 |
| 61 | security_and_services | 283.29 |
71 rows × 2 columns
In [348]:
# Jumlah biaya pengiriman dari kategori produk yang paling besar
product_order_items_df.groupby('product_category_name_english').freight_value.max().reset_index().sort_values(by='freight_value', ascending=False)
Out[348]:
| product_category_name_english | freight_value | |
|---|---|---|
| 6 | baby | 409.68 |
| 49 | housewares | 375.28 |
| 43 | health_beauty | 338.30 |
| 17 | construction_tools_construction | 321.46 |
| 50 | industry_commerce_and_business | 317.47 |
| ... | ... | ... |
| 61 | security_and_services | 25.77 |
| 35 | flowers | 22.93 |
| 27 | fashio_female_clothing | 22.84 |
| 46 | home_comfort_2 | 19.36 |
| 29 | fashion_childrens_clothes | 17.07 |
71 rows × 2 columns
Merge all dataset¶
In [349]:
all_df = pd.merge(left=order_customers_payments_reviews_df, right=product_order_items_df, how='left', on='order_id')
In [350]:
all_df.isna().sum()
# Banyak missing value karena pada tabel order_items_df, semua order_status masih terbawa, termasuk canceled, shipped, dll. Sehingga di-drop saja mising value tersebut.
Out[350]:
order_id 0 customer_id 0 order_status 0 order_purchase_timestamp 0 order_estimated_delivery_date 0 delivery_time 0 customer_unique_id 0 customer_zip_code_prefix 0 customer_city 0 customer_state 0 status 0 payment_sequential 0 payment_type 0 payment_installments 0 payment_value 0 review_id 0 review_score 0 order_item_id 1433 product_id 1433 seller_id 1433 shipping_limit_date 1433 price 1433 freight_value 1433 product_category_name 1433 product_name_lenght 1433 product_description_lenght 1433 product_photos_qty 1433 product_weight_g 1433 product_length_cm 1433 product_height_cm 1433 product_width_cm 1433 product_category_name_english 1433 dtype: int64
In [351]:
all_df.dropna(inplace=True)
all_df.reset_index(inplace=True)
In [352]:
all_df = pd.merge(left=all_df, right=sellers_df, how='left', on='seller_id')
# Sekarang sudah selesai, lalu ke tahap visualisasi
In [353]:
all_df.groupby('product_category_name_english').agg({
'price' : 'max',
'freight_value' : 'max',
'payment_value' : 'sum'
}).sort_values(by='payment_value', ascending=False)
Out[353]:
| price | freight_value | payment_value | |
|---|---|---|---|
| product_category_name_english | |||
| bed_bath_table | 1999.98 | 225.71 | 1707029.45 |
| health_beauty | 3124.00 | 338.30 | 1611835.65 |
| computers_accessories | 3699.99 | 134.17 | 1557712.98 |
| furniture_decor | 1899.00 | 215.43 | 1391930.08 |
| watches_gifts | 3999.90 | 209.63 | 1380487.85 |
| ... | ... | ... | ... |
| flowers | 65.90 | 22.93 | 1922.77 |
| home_comfort_2 | 219.99 | 19.36 | 1458.54 |
| cds_dvds_musicals | 65.00 | 52.58 | 1199.43 |
| fashion_childrens_clothes | 110.00 | 17.07 | 718.98 |
| security_and_services | 183.29 | 25.77 | 324.51 |
71 rows × 3 columns
In [354]:
all_df.groupby('product_category_name_english').agg({
'delivery_time' : 'max',
}).sort_values(by='delivery_time', ascending=False)
Out[354]:
| delivery_time | |
|---|---|
| product_category_name_english | |
| furniture_decor | 155.0 |
| market_place | 146.0 |
| housewares | 140.0 |
| art | 116.0 |
| bed_bath_table | 109.0 |
| ... | ... |
| home_comfort_2 | 34.0 |
| flowers | 34.0 |
| fashion_childrens_clothes | 31.0 |
| la_cuisine | 28.0 |
| arts_and_craftmanship | 25.0 |
71 rows × 1 columns
In [355]:
all_df.groupby('product_category_name_english').agg({
'review_score' : 'mean'
}).sort_values(by='review_score', ascending=False)
Out[355]:
| review_score | |
|---|---|
| product_category_name_english | |
| fashion_childrens_clothes | 5.000000 |
| cds_dvds_musicals | 4.642857 |
| books_imported | 4.525424 |
| books_general_interest | 4.502752 |
| small_appliances_home_oven_and_coffee | 4.453333 |
| ... | ... |
| fashion_male_clothing | 3.649635 |
| home_comfort_2 | 3.642857 |
| office_furniture | 3.553204 |
| diapers_and_hygiene | 3.378378 |
| security_and_services | 2.500000 |
71 rows × 1 columns
Insight:
- all_df diperlukan guna melihat insight dari data customer dengan produk kategori
- geolocation_df akan dimerge pada bagian geospatial analysis
- payment_value merupakan hasil dari freight_value + price
- review produk tertinggi dari kategori produk fashion children, lalu cds_dvds_musical, dan books_imported (analisis ini tidak melihat jumalah nilai rating dari setiap kategori produk)
- delivery_time terlama berada di produk furniture_decor, market_place, dan housewares.
Visualization & Explanatory Analysis¶
- Pertanyaan 1: What Bagaimana Performa penjualan pada beberapa tahun terakhir dalam skala per bulan? (performa banyaknya order dan total revenue per month) done
- Pertanyaan 2: Produk apa yang paling menghasilkan revenue dan paling laku serta produk paling tidak laku dan tidak menghasilkan revenue?
- Pertanyaan 3: State apa yang menghasilkan revenue dan jumlah order paling tinggi?
- Pertanyaan 4: Metode pembayaran apa yang paling banyak jumlah pembayarannya dan sering digunakan dari berbagai states?
- Pertanyaan 5: RFM Analysis
- Pertanyaan 6: Bagaimana jika distribusi customer berdasarkan lokasi geografis dalam bentuk map? (Geospatial Analysis)
- Pertanyaan 7: Bagaimana jika mengkategorikan customer berdasarkan jumlah total pengeluaran untuk pembelian produk di platform ini? (Clustering)
Pertanyaan 1: Bagaimana performa penjualan pada beberapa tahun terakhir dalam skala per bulan? (performa banyaknya order dan total revenue per month)¶
In [356]:
monthly_orders_df = all_df.resample(rule='M', on='order_purchase_timestamp').agg({
"order_id": "nunique",
"payment_value": "sum"
})
monthly_orders_df.index = monthly_orders_df.index.strftime('%b %y') #mengubah format order date menjadi nama bulan
monthly_orders_df = monthly_orders_df.reset_index()
monthly_orders_df.rename(columns={
"order_id": "order_count",
"payment_value": "revenue"
}, inplace=True)
monthly_orders_df.head().sort_values(by='revenue', ascending=False)
plt.figure(figsize=(10, 5))
plt.plot(monthly_orders_df["order_purchase_timestamp"], monthly_orders_df["order_count"], marker='o', linewidth=2, color="#72BCD4")
plt.title("Total Orders per Month (Oct 2016- Aug 2018)", loc="center", fontsize=18)
plt.xticks(fontsize=10, rotation=45)
plt.yticks(fontsize=10)
plt.grid(linestyle='--', alpha= 0.6, which='major', axis='y')
plt.grid(linestyle='dotted', alpha= 0.6, which='major', axis='x')
plt.show()
C:\Users\Doni\AppData\Local\Temp\ipykernel_8616\3260640141.py:1: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
monthly_orders_df = all_df.resample(rule='M', on='order_purchase_timestamp').agg({
In [357]:
from matplotlib.ticker import FuncFormatter
def format_with_units(x, pos):
if x >= 1e6:
return f'{x/1e6:.1f}M'
elif x >= 1e3:
return f'{x/1e3:.1f}K'
else:
return f'{x:.0f}'
plt.figure(figsize=(10, 5))
plt.plot(
monthly_orders_df['order_purchase_timestamp'],
monthly_orders_df['revenue'],
marker ='o',
color = 'seagreen',
linewidth=2
)
plt.title("Total Revenues per Month (Oct 2016- Aug 2018)", loc="center", fontsize=18)
plt.xticks(fontsize=10, rotation=45)
plt.yticks(fontsize=10)
plt.gca().yaxis.set_major_formatter(FuncFormatter(format_with_units))
plt.grid(linestyle='--', alpha= 0.6, which='major', axis='y')
plt.grid(linestyle='dotted', alpha= 0.6, which='major', axis='x')
plt.show()
In [358]:
sum_orders_items_df = all_df.groupby(['product_category_name_english']).agg({
'order_item_id' :'sum'
}).sort_values(by='order_item_id', ascending=False).reset_index()
sum_orders_revenues_df = all_df.groupby(['product_category_name_english']).agg({
'payment_value' :'sum'
}).sort_values(by='payment_value', ascending=False).reset_index()
Pertanyaan 2: Produk apa yang paling menghasilkan revenue dan paling laku serta produk paling tidak laku dan tidak menghasilkan revenue?¶
In [359]:
# Pola korelasi dari jumlah banyaknya order dengan jumlah revenue yang didapat dalam beberapa kategori produk
max_order_items = sum_orders_items_df['order_item_id'].max()
max_payment = sum_orders_revenues_df['payment_value'].max()
max_x_order = max_order_items * 1.1
max_x_payment = max_payment * 1.1
fig, ax = plt.subplots(2,2, figsize=(35,20))
sns.barplot(y='product_category_name_english', x='order_item_id', data=sum_orders_items_df.head(15), ax=ax[0, 0],hue="order_item_id")
ax[0, 0].set_ylabel(None)
ax[0, 0].tick_params(axis ='both', labelsize=14)
ax[0, 0].set_xlabel('Sum of order items')
ax[0, 0].set_title("15 Best Performing Product Based on Order Items", loc="center", fontsize=20)
ax[0, 0].set_xlim(0, max_x_order)
sns.barplot(y='product_category_name_english', x='payment_value', data=sum_orders_revenues_df.head(15), ax=ax[0, 1], hue='payment_value')
ax[0, 1].set_ylabel(None)
ax[0, 1].set_xlabel('Sum of Revenues in millions')
ax[0, 1].tick_params(axis ='both', labelsize=14)
ax[0, 1].set_title("15 Best Performing Product Based on Revenues", loc="center", fontsize=20)
ax[0, 1].set_xlim(0, max_x_payment)
sns.barplot(y='product_category_name_english', x='order_item_id', data=sum_orders_items_df.tail(20).sort_values(by='order_item_id',ascending=False), ax=ax[1, 0], color= '#e9d4d0')
ax[1, 0].set_ylabel(None)
ax[1, 0].set_xlabel('Sum of order items')
ax[1, 0].tick_params(axis= 'both', labelsize=14)
ax[1, 0].set_title("20 Worst Performing Product Based on Order Items", loc="center", fontsize=20)
ax[1, 0].set_xlim(0, max_x_order)
sns.barplot(y='product_category_name_english', x='payment_value', data=sum_orders_revenues_df.tail(20).sort_values(by='payment_value',ascending=False), ax=ax[1, 1], color= '#e9d4d0')
ax[1, 1].set_ylabel(None)
ax[1, 1].set_xlabel('Sum of Revenues in millions')
ax[1, 1].tick_params(axis ='both', labelsize=14)
ax[1, 1].set_title("20 Worst Performing Product Based on Revenues", loc="center", fontsize=20)
ax[1, 1].set_xlim(0, max_x_payment)
plt.show()
Pertanyaan 3: State apa yang menghasilkan revenue dan jumlah order paling tinggi?¶
In [360]:
sum_orders_items_state_df = all_df.groupby(['customer_state']).agg({
'order_item_id' :'sum'
}).sort_values(by='order_item_id', ascending=False).reset_index()
sum_orders_revenues_state_df = all_df.groupby(['customer_state']).agg({
'payment_value' :'sum'
}).sort_values(by='payment_value', ascending=False).reset_index()
In [361]:
fig, ax = plt.subplots(2,1, figsize=(25, 20))
sns.barplot(y='customer_state', x='order_item_id', data=sum_orders_items_state_df, ax=ax[0], hue="order_item_id")
ax[0].set_ylabel(None)
ax[0].tick_params(axis ='both', labelsize=14)
ax[0].set_xlabel('Sum of order items')
ax[0].set_title("Total Order Items Based on States", loc="center", fontsize=20)
sns.barplot(y='customer_state', x='payment_value', data=sum_orders_revenues_state_df, ax=ax[1], hue="payment_value")
ax[1].set_ylabel(None)
ax[1].set_xlabel('Sum of Revenues in millions')
ax[1].tick_params(axis ='both', labelsize=14)
ax[1].set_title("Total Revenues Based on States", loc="center", fontsize=20)
plt.show()
Pertanyaan 4: Metode pembayaran apa yang paling banyak jumlah pembayarannya dan sering digunakan dari berbagai states?¶
In [362]:
payment_type_df = all_df.groupby(['customer_state', 'payment_type'])['payment_value'].sum().unstack()
payment_type_df['total'] = payment_type_df.sum(axis=1)
payment_type_df = payment_type_df.sort_values(by='total', ascending=False).drop(columns='total')
payment_type_df.head()
Out[362]:
| payment_type | boleto | credit_card | debit_card | voucher |
|---|---|---|---|---|
| customer_state | ||||
| SP | 1455551.18 | 5612269.64 | 101543.62 | 152087.26 |
| RJ | 455897.67 | 2082298.82 | 41396.07 | 53389.33 |
| MG | 480977.97 | 1699006.55 | 25584.75 | 40050.10 |
| RS | 261426.66 | 794579.00 | 12315.64 | 25609.54 |
| PR | 219597.59 | 771443.51 | 10313.67 | 21443.90 |
In [363]:
transaction_counts_df= all_df.groupby(['customer_state', 'payment_type']).size().unstack()
transaction_counts_df['total'] = transaction_counts_df.sum(axis=1)
transaction_counts_df = transaction_counts_df.sort_values(by='total', ascending=False).drop(columns='total')
transaction_counts_df.head()
Out[363]:
| payment_type | boleto | credit_card | debit_card | voucher |
|---|---|---|---|---|
| customer_state | ||||
| SP | 9137.0 | 35240.0 | 829.0 | 2616.0 |
| RJ | 2405.0 | 11032.0 | 200.0 | 929.0 |
| MG | 2620.0 | 9835.0 | 148.0 | 615.0 |
| RS | 1548.0 | 4395.0 | 76.0 | 294.0 |
| PR | 1249.0 | 4185.0 | 72.0 | 284.0 |
In [364]:
from matplotlib.ticker import FuncFormatter
def format_with_units(x, pos):
if x >= 1e6:
return f'{x/1e6:.1f}M'
elif x >= 1e3:
return f'{x/1e3:.1f}K'
else:
return f'{x:.0f}'
payment_type_df.plot(kind='bar', stacked=True, figsize=(15,10))
plt.xlabel('Customer State')
plt.ylabel('Total Amount')
plt.title('Total Ammount Used on Payment Each Payment Type by Customer State')
plt.gca().yaxis.set_major_formatter(FuncFormatter(format_with_units))
transaction_counts_df.plot(kind='bar', stacked=True, figsize=(15,10))
plt.xlabel('Customer State')
plt.ylabel('Number of Transactions')
plt.title('Number of Orders for Each Payment Type by Customer State')
plt.show()
Insight:
- Ada kemungkinan tren musiman dengan puncak pada bulan-bulan tertentu bulan November 2017 dan Desember2017 serta penurunan pada bulan-bulan lain (seperti Februari atau Maret). Tren pendapatan seringkali sejalan dengan volume order, tetapi mungkin menunjukkan perbedaan tergantung pada promosi, kategori produk, atau order bernilai tinggi di bulan-bulan tertentu.
- bed_bath_table, furniture_decor, health beauty dan computer_accessories menjadi kategori 3 produk paling laris
Analisis Lanjutan (Opsional)¶
Pertanyaan 5: RFM Analysis¶
In [365]:
rfm_df = all_df.groupby(by="customer_id", as_index=False).agg({
"order_purchase_timestamp": "max", # mengambil tanggal order terakhir
"order_id": "nunique", # menghitung jumlah order
"payment_value": "sum" # menghitung jumlah revenue yang dihasilkan
})
rfm_df.rename(columns={
'payment_value' : 'monetary',
'order_id' : 'frequency',
'order_purchase_timestamp' : 'max_order_timestamp'
}, inplace=True)
# menghitung kapan terakhir pelanggan melakukan transaksi (hari)
rfm_df["max_order_timestamp"] = rfm_df["max_order_timestamp"].dt.date
recent_date = all_df["order_purchase_timestamp"].dt.date.max()
rfm_df["recency"] = rfm_df["max_order_timestamp"].apply(lambda x: (recent_date - x).days)
rfm_df.drop("max_order_timestamp", axis=1, inplace=True)
rfm_df.head()
Out[365]:
| customer_id | frequency | monetary | recency | |
|---|---|---|---|---|
| 0 | 00012a2ce6f8dcda20d059ce98491703 | 1 | 114.74 | 288 |
| 1 | 000161a058600d5901f007fab4c27140 | 1 | 67.41 | 409 |
| 2 | 0001fd6190edaaf884bcaf3d49edf079 | 1 | 195.42 | 547 |
| 3 | 0002414f95344307404f0ace7a26f1d5 | 1 | 179.35 | 378 |
| 4 | 000379cdec625522490c315e70c7a9fb | 1 | 107.01 | 149 |
In [366]:
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(30, 6))
sns.barplot(y="recency", x="customer_id", data=rfm_df.sort_values(by="recency", ascending=True).head(5), ax=ax[0])
ax[0].set_ylabel(None)
ax[0].set_xlabel(None)
ax[0].set_title("By Recency (days)", loc="center", fontsize=12)
ax[0].tick_params(axis ='x', labelsize=12, rotation=90)
sns.barplot(y="frequency", x="customer_id", data=rfm_df.sort_values(by="frequency", ascending=False).head(5), ax=ax[1])
ax[1].set_ylabel(None)
ax[1].set_xlabel(None)
ax[1].set_title("By Frequency", loc="center", fontsize=12)
ax[1].tick_params(axis='x', labelsize=12, rotation=90)
sns.barplot(y="monetary", x="customer_id", data=rfm_df.sort_values(by="monetary", ascending=False).head(5), ax=ax[2])
ax[2].set_ylabel(None)
ax[2].set_xlabel(None)
ax[2].set_title("By Monetary", loc="center", fontsize=12)
ax[2].tick_params(axis='x', labelsize=12, rotation=90)
plt.suptitle("Best Customer Based on RFM Parameters (customer_id)", fontsize=15)
plt.show()
Pertanyaan 6: Geospatial Analysis¶
Bagaimana jika distribusi customer berdasarkan lokasi geografis dalam bentuk map?
In [367]:
import pandas as pd
import folium
from folium.plugins import MarkerCluster
In [368]:
customer_geolocation_df = pd.merge(customers_df, geolocation_df,
left_on='customer_zip_code_prefix',
right_on='geolocation_zip_code_prefix')
In [369]:
customer_geolocation_df = customer_geolocation_df.groupby('customer_id').apply(lambda x: x.sample(1, random_state=42)).reset_index(drop=True)
C:\Users\Doni\AppData\Local\Temp\ipykernel_8616\4279034533.py:1: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
customer_geolocation_df = customer_geolocation_df.groupby('customer_id').apply(lambda x: x.sample(1, random_state=42)).reset_index(drop=True)
In [370]:
all_df = pd.merge(all_df, customer_geolocation_df, how='inner', on='customer_id')
In [371]:
all_df.columns
Out[371]:
Index(['index', 'order_id', 'customer_id', 'order_status',
'order_purchase_timestamp', 'order_estimated_delivery_date',
'delivery_time', 'customer_unique_id_x', 'customer_zip_code_prefix_x',
'customer_city_x', 'customer_state_x', 'status_x', 'payment_sequential',
'payment_type', 'payment_installments', 'payment_value', 'review_id',
'review_score', 'order_item_id', 'product_id', 'seller_id',
'shipping_limit_date', 'price', 'freight_value',
'product_category_name', 'product_name_lenght',
'product_description_lenght', 'product_photos_qty', 'product_weight_g',
'product_length_cm', 'product_height_cm', 'product_width_cm',
'product_category_name_english', 'seller_zip_code_prefix',
'seller_city', 'seller_state', 'customer_unique_id_y',
'customer_zip_code_prefix_y', 'customer_city_y', 'customer_state_y',
'status_y', 'geolocation_zip_code_prefix', 'geolocation_lat',
'geolocation_lng', 'geolocation_city', 'geolocation_state'],
dtype='object')
In [372]:
all_df.drop(columns=['customer_unique_id_y', 'customer_zip_code_prefix_y', 'customer_city_y', 'customer_state_y', 'status_y'], inplace=True)
In [373]:
all_df.rename(columns={
'customer_unique_id_x' : 'customer_unique_id',
'customer_zip_code_prefix_x' : 'customer_zip_code_prefix',
'customer_city_x' : 'customer_city',
'customer_state_x' : 'customer_state',
'status_x' : 'status',
}, inplace=True)
In [374]:
all_df['customer_city'] = all_df['customer_city'].astype('category')
all_df['order_status'] = all_df['order_status'].astype('category')
all_df['customer_state'] = all_df['customer_state'].astype('category')
all_df['payment_type'] = all_df['payment_type'].astype('category')
all_df['seller_city'] = all_df['seller_city'].astype('category')
all_df['geolocation_city'] = all_df['geolocation_city'].astype('category')
all_df['geolocation_state'] = all_df['geolocation_state'].astype('category')
all_df['status'] = all_df['status'].astype('category')
all_df['product_category_name'] = all_df['product_category_name'].astype('category')
all_df['product_category_name_english'] = all_df['product_category_name_english'].astype('category')
In [375]:
all_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 112912 entries, 0 to 112911 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 112912 non-null int64 1 order_id 112912 non-null object 2 customer_id 112912 non-null object 3 order_status 112912 non-null category 4 order_purchase_timestamp 112912 non-null datetime64[ns] 5 order_estimated_delivery_date 112912 non-null datetime64[ns] 6 delivery_time 112912 non-null float64 7 customer_unique_id 112912 non-null object 8 customer_zip_code_prefix 112912 non-null int64 9 customer_city 112912 non-null category 10 customer_state 112912 non-null category 11 status 112912 non-null category 12 payment_sequential 112912 non-null float64 13 payment_type 112912 non-null category 14 payment_installments 112912 non-null float64 15 payment_value 112912 non-null float64 16 review_id 112912 non-null object 17 review_score 112912 non-null float64 18 order_item_id 112912 non-null float64 19 product_id 112912 non-null object 20 seller_id 112912 non-null object 21 shipping_limit_date 112912 non-null object 22 price 112912 non-null float64 23 freight_value 112912 non-null float64 24 product_category_name 112912 non-null category 25 product_name_lenght 112912 non-null float64 26 product_description_lenght 112912 non-null float64 27 product_photos_qty 112912 non-null float64 28 product_weight_g 112912 non-null float64 29 product_length_cm 112912 non-null float64 30 product_height_cm 112912 non-null float64 31 product_width_cm 112912 non-null float64 32 product_category_name_english 112912 non-null category 33 seller_zip_code_prefix 112912 non-null int64 34 seller_city 112912 non-null category 35 seller_state 112912 non-null object 36 geolocation_zip_code_prefix 112912 non-null int64 37 geolocation_lat 112912 non-null float64 38 geolocation_lng 112912 non-null float64 39 geolocation_city 112912 non-null category 40 geolocation_state 112912 non-null category dtypes: category(10), datetime64[ns](2), float64(17), int64(4), object(8) memory usage: 28.4+ MB
In [376]:
customer_geolocation_df = customer_geolocation_df.sample(n=10000, random_state=42)
In [377]:
location_counts = customer_geolocation_df.groupby(['geolocation_lat', 'geolocation_lng', 'geolocation_city', 'geolocation_state']).size().reset_index(name='customer_count')
location_counts.head()
Out[377]:
| geolocation_lat | geolocation_lng | geolocation_city | geolocation_state | customer_count | |
|---|---|---|---|---|---|
| 0 | -33.527671 | -53.373545 | santa vitoria do palmar | RS | 1 |
| 1 | -32.204285 | -52.179458 | rio grande | RS | 1 |
| 2 | -32.181549 | -52.170984 | rio grande | RS | 1 |
| 3 | -32.110758 | -52.175673 | rio grande | RS | 1 |
| 4 | -32.095174 | -52.179897 | rio grande | RS | 1 |
In [378]:
center_lat = location_counts['geolocation_lat'].mean()
center_lng = location_counts['geolocation_lng'].mean()
m = folium.Map(location=[center_lat, center_lng], zoom_start=6)
In [379]:
marker_cluster = MarkerCluster().add_to(m)
In [380]:
for idx, row in location_counts.iterrows():
folium.CircleMarker(
location=[row['geolocation_lat'], row['geolocation_lng']],
radius=5, # Base size of the circle
popup=f"City: {row['geolocation_city']}, State: {row['geolocation_state']}<br>Customers: {row['customer_count']}",
color='blue',
fill=True,
fill_color='blue',
fill_opacity=0.7,
weight=2
).add_to(marker_cluster)
In [381]:
m
Out[381]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [382]:
m.save("simple_customer_map.html")
print("Simple customer map saved as 'simple_customer_map.html'")
Simple customer map saved as 'simple_customer_map.html'
Pertanyaan 7: Clustering¶
Bagaimana jika mengkategorikan customer berdasarkan jumlah total pengeluaran untuk pembelian produk di platform ini?
In [383]:
bins = [0, 50, 200, 500, 1000, 5000, float('inf')]
customer_value_labels = ['Very Low', 'Low', 'Medium', 'High', 'Very High', 'Rich Loyalist']
rfm_df['spending_category'] = pd.cut(rfm_df['monetary'], bins=bins, labels=customer_value_labels, right=False)
In [384]:
rfm_count = rfm_df['spending_category'].value_counts().reset_index()
In [386]:
bins = [0, 50, 200, 500, 1000, 5000, float('inf')]
customer_value_labels = ['Very Low', 'Low', 'Medium', 'High', 'Very High', 'Rich Loyalist']
rfm_df['spending_category'] = pd.cut(rfm_df['monetary'], bins=bins, labels=customer_value_labels, right=False)
rfm_count = rfm_df['spending_category'].value_counts().reset_index()
rfm_count = rfm_count.sort_values(by='count', ascending=False)
rfm_count.head()
Out[386]:
| spending_category | count | |
|---|---|---|
| 0 | Low | 55629 |
| 1 | Medium | 16330 |
| 2 | Very Low | 15831 |
| 3 | High | 4474 |
| 4 | Very High | 2103 |
In [387]:
all_df.to_csv('all_df.csv')
Conclusion¶
Conclusion pertanyaan 1¶
Jumlah Order: Ada kemungkinan tren musiman dengan puncak pada bulan-bulan tertentu bulan November atau musim liburan di bulan Desember dan penurunan pada bulan-bulan lain, seperti Februari atau Maret.
Total Revenue: Tren pendapatan seringkali sejalan dengan volume order, tetapi mungkin menunjukkan perbedaan tergantung pada promosi, kategori produk, atau order bernilai tinggi di bulan-bulan tertentu.
Conclusion pertanyaan 2¶
Produk Terbaik (Revenue): Kategori seperti elektronik, ponsel, atau peralatan rumah tangga cenderung menghasilkan pendapatan lebih tinggi karena harga per unit yang lebih tinggi.
Produk Terlaris (Volume Order): Kategori hampir sama dengan produk terbaik secara revenue mungkin terjual dalam jumlah lebih banyak tetapi dengan harga yang lebih rendah.
Produk Terburuk (Revenue & Volume): Barang-barang khusus atau produk-produk niche, terutama dalam kategori barang mewah, mungkin memiliki volume penjualan dan pendapatan yang lebih rendah.
Conclusion pertanyaan 3¶
State Terbaik: São Paulo, Rio de Janeiro, dan Minas Gerais biasanya merupakan state dengan pendapatan dan volume order tertinggi karena populasinya yang besar dan infrastruktur e-commerce yang lebih berkembang.
State Terburuk: State yang kurang padat penduduknya atau yang infrastrukturnya kurang berkembang, seperti Acre atau Roraima, atau yang di dekat hutan amazon, mungkin menunjukkan volume order dan pendapatan yang lebih rendah.
Conclusion pertanyaan 4¶
Metode Paling Populer: Pembayaran dengan kartu kredit biasanya paling umum di berbagai state karena fleksibilitasnya untuk pembayaran secara cicilan.
Conclusion pertanyaan 5¶
RFM Tinggi: Pelanggan setia yang sering membeli dan menghabiskan banyak uang serta recency yang kecil
RFM Rendah: Pelanggan yang tidak melakukan pembelian baru-baru ini, tidak membeli secara sering, atau menghabiskan lebih sedikit.
Hasilnya berupa segmentasi pelanggan
Conclusion pertanyaan 6¶
Pelanggan dengan pengeluaran tinggi (High-Rich Loyalist) yang kemungkinan lebih makmur dan mungkin membeli barang-barang bernilai lebih tinggi seperti elektronik atau barang mewah.
Pengeluaran menengah (Medium) yang membeli barang-barang dengan harga sedang, mungkin fokus pada produk rumah tangga atau perawatan pribadi.
Pengeluaran rendah (Very Low - Low) yang mungkin hanya membeli barang-barang murah dan sering dikonsumsi seperti produk kecantikan atau kebutuhan sehari-hari.
Conclusion pertanyaan 7¶
Sebagian besar pelanggan terkonsentrasi di daerah metropolitan, terutama di sekitar São Paulo dan Rio de Janeiro.
Mungkin ada wilayah yang kurang terjangkau (misalnya di bagian Barat Laut Brasil) di mana basis pelanggan jarang karena keterbatasan logistik atau hambatan ekonomi.